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ABSTRACT 


We consider the problem of optimally assigning the modules of a parallel/pipelined pro- 
gram over the processors of a multiple computer system under certain restrictions on the inter- 
connection structure of the program as well as the multiple computer system. We show that 
for a variety of such programs it is possible to find in linear time if a partition of the program 
exists in which the load on any processor is within a certain bound. This method, when com- 
bined with a binary search over a finite range, provides an approximate solution to the parti- 
tioning problem. 

The specific problems we consider are partitioning of (1) a chain structured parallel pro- 
gram over a chain like-computer system, (2) multiple chain like programs over a host-satellite 
system, and (3) a tree structured parallel program over a host-satellite system. 

For a problem with m modules and n processors, the complexity of our algorithm is no 
worse than 0(mnlog(W T /e)), where W T is the cost of assigning all modules to one processor 
and e the desired accuracy. 


Supported by NASA Contracts NAS1-17070 and NAS1-18107 while the author was in residence at the In- 
stitute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center. 
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1. Introduction 

With the proliferation of relatively cheap parallel computers in the research as well as 
the commercial field, it is becoming increasingly important to efficiently utilize the powerful 
hardware. One important requirement is that the task being executed be partitioned over the 
multiple computer system in an optimal fashion so as to minimize the total execution time of 
the job. In general the problem of finding the optimal partition of an arbitrarily connected 
distributed/parallel program over an arbitrarily connected multiple computer system is very 
difficult If, however, the modules of the program communicate in a restricted manner and the 
multiple computer system has a special structure then it is possible to solve some of the parti- 
tioning problems. It has been shown in [1] that if the number of processors is limited to 2 
then the partitioning problem can be solved efficiently for a distributed processor system. 
Similarly if the interconnection structure of the distributed program is tree-like then it is possi- 
ble to efficiently partition the program over any number of processors [2]. 

The problem of optimally partitioning a modular program in a parallel processing 
environment is discussed in [3]. If the interconnection structure of the program is chain or 
tree-like and the parallel processor is either connected as a chain or is a host-satellite system, 
then [3] shows how the program may be partitioned optimally in polynomial time. Other 
related research in this field, which includes sub-optimal or approximate solutions to the parti- 
tioning problem, is reported in [4], [5] and [6]. 

In this paper we describe a fully polynomial time approximation scheme which provides 
approximate solutions to most of the partitioning problems discussed in [3] and already solved 
by using pure polynomial algorithms. In order to appreciate the usefulness of the approximate 
solutions, one should bear in mind that data for the problem being solved is often only known 
approximately. Hence an approximate solution may be as meaningful as an exact solution for 
many of the practical problems [4] where the extra accuracy of the exact solution is not 
needed and where the approximate solution can be obtained in a relatively short time [7]. 

In Section 2 we discuss an algorithm for finding the optimal partition of a chain struc- 
tured parallel or pipelined program over a chain of identical processors. We assume that the 
program is made up of m modules numbered l..m and has an intercommunication pattern such 
that module i can communicate only with modules i+1 and i-1. Similarly we assume that the 
multiprocessor of size n<m has also a chain like architecture. We work under the constraint 
that each processor has a contiguous subchain of program modules assigned to it. Thus the 
partitions of the chains have to be such that modules i and i+1 are assigned to the same or 
adjacent processors. The optimal partitioning would then be the assignment of subchains of 
program modules to processors that minimizes the load on the most heavily loaded processor. 

The central result of Section 2 is that for a trial weight w, it is possible to find if a parti- 
tion exists in which the load on each processor is less than or equal to w in time proportional 
to 0(mn). The optimal partition is then found by making a binary search in a given range to 
find the partition for which w is minimum. The approach we use here is a fully polynomial 
time approximation scheme [8] and is an extension of the method discussed in [4] for finding 
the optimal partition of a one dimensional domain over a chain of processors. 

In Section 3 and 4 we show that this technique can be used to optimally partition pro- 
grams composed of multiple chains over a multiple computer system based on a single-host 
and multiple-satellite architecture. The time required by the entire system to complete the 
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processing is determined by the greater of (1) the individual load on the most heavily loaded 
satellite and (2) the sum of the collective loads on the host. 

We discuss the partitioning of a chain-like program over a shared memory system in 
Section 5. In such a system, the total processing time is determined by the greater of (1) the 
individual load on the most heavily loaded processor and (2) the sum of communication costs 
between all pairs of processors that communicate through the shared memory [3]. We use 
Kemighan’s approach [10] to design a probing function. The probing function returns true if 
it is possible to partition the program such that the load on any processor and the total com- 
munication cost is less than or equal to a trial weight w, and false otherwise. The optimal par- 
tition is then found by making a binary search in a finite range. 

In Section 6 we discuss an approach to optimally partition a tree-structured parallel or 
pipelined program over a single-host multiple-satellite system. This algorithm is also based 
on a probing function. 

The paper concludes with a discussion of our results in Section 7. 


2. An Algorithm for Partitioning Chains 

We describe in this section how a chain structured parallel or pipelined program can be 
optimally partitioned over a chain of identical processors. We assume that a chain structured 
program is made up of m modules numbered l..m and has an intercommunication pattern such 
that module i can communicate only with modules i + 1 and i-1. Similarly we assume that the 
multiprocessor of size n<m has also a chain like architecture. 

We work under the constraint that each processor has a contiguous subchain of program 
modules assigned to it. Thus the partitions of the chains have to be such that modules i and 
t+1 are assigned to the same or adjacent processors. The optimal partitioning would then be 
the assignment of subchains of program modules to processors that minimizes the load on the 
most heavily loaded processor. 

It is convenient for us to assume that should the optimal assignment dictate that fewer 
than the available n processors be used, we can simply ignore the impact of communicating 
with the outside world through a subchain of unused processors. If desired, it is very simple 
to account for this overhead by concatenating dummy modules to the chain. 

Notation: 

vv t - time consumed in the execution of module i. 

Ci time to communicate between module i and i + 1 if the two modules are assigned 

to different processors. We assume that c 0 (c^, is the time for module 1 (m) to 
communicate with the outside world. 

Definitions: 

w max maximum value of w ; for 1 <i<m 

W T load on a processor if all the m modules are assigned to it. Thus 

m 

^7=Z w i + c 0 + c m- 
i= 1 
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Qj k load on a processor if subchain Modules[/..fc] is assigned to it. It is given by 

+ c k + c H . 

H 

A k total remaining load to be assigned given that module k is the last module 

assigned to a processor. This is given by the following equation: 

m 

X w i +c k+ c m- 

i=k + 1 

weight The weight of a partition is the weight of its heaviest subchain. 

The assignment algorithm based on the function PROBE 1 (described below), takes 
advantage of the fact that the load assigned to the most heavily loaded processor in the 
optimal partition lies somewhere between W T and W T /n. This is because at worst all modules 
are assigned to the same processor, which has load W T . At best load is uniformly distributed 
over all processors with Wjiti on each. The algorithm selects a trial weight w in the above 
range and then uses the function PROBE 1. The function PROBEl(w) returns true if it is pos- 
sible to partition the chain of modules into subchains such that the load on each processor is 
less than or equal to w, and false otherwise. The partition that function PROBE 1 obtains is 
called a conservative partition. 

2.1. The Algorithm 

function PROBE 1( Processors[\.,n\, Modules[l..m\,w)\hoo\t<iX\\ 
begin 

j= 1; k = 0;p = 1; A^ = W T ; 
while p < n do 
begin 

for x = j to m do 

if < w and A x < A^n then 
begin 

^min = ^x> 

k = x; 

Assign subchain Modules\j..k\ to processor p\ 
if k = m then return(/r«c); 
end; 

j = k+l; p =p+l; 

end; 

return (false); 
end. 

2.2. Discussion 

In order to understand the working of the function PROBE 1 it is important to note that: 

1. The function assigns Modules[/..&] to processor p such that A k is minimum and 
for j<k<m. The corresponding minimum value of A k is denoted by A,^. 
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2. If while assigning Modules [&+1.. k'] to processor p + 1 it is found that Cfr+w k+ i>w then 
obviously it is not possible to have a conservative partition of weight w and the function 
PROBEl returns false. 

3. If while assigning Modules[£+l.jt] to processor p+ 1, it is found that Ap>A raiB when 
^ + i^w for k+\<x<m then the function again returns false. If instead the function 
assigns some modules to processor /H-l then either the condition £2^. +1(X <w, will be 
violated or the function will encounter a situation similar to (2). This can be explained 
with the help of Fig. 1. Suppose ^t+i^vv and A^A^ for k+l<x<q, while £2*+ \ t f >w 
and A^Ajnj;,, for x>q. To make sure that the load on processor p + 1 is less than or equal 
to w the value of x should be equal to or less than q. But then O x+1?+1 will become 
larger than £2* +1/?+1 because ApA^,, and thus we are led to a situation similar to (2). 
Thus if 0, p + 1 is the first processor and Q.\f>W T for x>q then the optimal partition 
will require that all m modules be placed on a processor; otherwise, the load on some 
processors will become larger than W T . 

In the following paragraphs we shall first examine a detailed example of how the func- 
tion PROBEl works and then prove that if there exists a partition of weight w then the func- 
tion PROBEl will always find that or a better assignment. 

2.3. An Example 

Before proving correctness, let us examine a detailed example of how the function 
PROBEl tries to find a conservative partition of weight w. Fig. 2 shows a 10 module chain to 
be mapped on a 4 processor chain with a trial weight w=20. The number below each module 
is its execution cost while the number above each edge is the communication cost for the two 
modules at the ends of that edge. Thus for module 4 the value of w 4 =6 and c 4 =12. The 
value of W T for this problem is 54 (we have assumed that c 0 =c m =0). 

Fig. 3 (bottom) shows a plot (grey line) of £2^ for processor 1 against x. It can be seen 
from the figure that the load on processor 1 increases from zero, when no module is assigned 
to it, to W T when all 10 modules are assigned to processor 1. The value of remaining load A x 
is also plotted (black line) against x. This decreases from W T , when no module is assigned to 
processor 1, to zero when all the 10 modules are assigned to it. It is evident from Fig. 3 that 
the rise of £2^ and the fall of A* with x, are not monotonic. Thus if we initially assign 
Modules[1..2] to processor 1 and then further assign Modules[3..4] to it then the value of A* 
instead of decreasing, increases from 40 to 44. The reason for this behavior is non-uniform 
communication costs. 

The subchain Modulesfl.jc] is assigned to processor 1 such that A* is minimum and 
£2 ljJC <20. The value of x which satisfies the above constraints is 2. It can be seen from Fig. 3 
that the condition (£2^20) will still be satisfied if we further assign module 3 to processor 1, 
but then the remaining load will increase which will make it impossible to find a conservative 
partition of weight vv=20 for the rest of the module chain in this example. The resulting 
assignment of modules to processor 1 is shown in Fig. 3 (top). 

Having assigned modules to processor 1 we plot the load on processor 2, £2 3>x , (grey 
line) and the remaining load A x (black line) as a function of x in Fig. 4. The resulting 
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assignment of modules to processor 2, shown in Fig. 4 (top), is selected on the same basis 
that and A* should be minimum. Finally we draw Q 6<x and A x for processor 3 in Fig. 

5. For die assignment of modules to processor 3, shown in Fig. 5 (top), the remaining load is 
just equal to the trial weight and thus it is possible to find a conservative partition of weight 
w of the module chain in the above example. 

2.4. Proof of Correctness 

Claim If a problem with m modules and n processors has a partition of weight w, 
PROBE 1 will find that or a partition of less weight. 

Proof By induction on n. 

Consider the case n=2. Suppose the given partition of weight w assigns 
Modules [1../J to processor 1 and Modules[/ g +l..m] to processor 2. 

Apply PROBEl to this problem. Suppose it assigns Modules[l.j c J to processor 1 
and j c +l..m to processor 2. Because of the way in which PROBEl proceeds, A; 
will be minimized under the constraint Q ljc <w. But because n- 2, 

m 

A j= £ Wj+C:+c OT = Clj + 1>m = the weight of the second partition. 
i-jc+l 

Thus the weight of the second partition will be minimized under the constraint 
that the weight of the first partition is <w. If there exists a partition in which the 
weight of both subchains is <m>, PROBEl will clearly find it. The claim is thus 
true for n= 2. Note that the proof is independent of m. 

We will now show that if the claim is correct for n=k it is also correct for n-k+l. 
Suppose we are given a chain of m modules which has a partition of weight w. 
Assume that in this given partition, Modules[l../ g ] are assigned to processor 1 and 
jg+l.Jig to processor 2. 

Starting with module 1, scan the modules from left to right to identify the module 
j c such that n lJe <w and A Jc is minimum. Delete the nodes[l.j c ]. 

Three cases are now possible: 

Case(l) Jc-jg- the subchain deleted corresponds to the first subchain of the given 
partition. In this case the remaining nodes[/ c +l../n] must have a partition 
with weight w and n=k subchains (because the original chain was given 
with n=k+ 1 subchains). 

Case(2) Jc < jg : this means that A^<A^ which implies that Q^<f2^<vv, i.e. the 
second subchain of the given partition has had its weight reduced below w. 
Case(3) j^jg- again only possible if Ay e <Ay which implies that Qj ci .<Qjj Cg <w, i.e. the 
second subchain has had its weight reduced. 

In all three cases the remaining nodes[/ c +l..m] must have a partition with weight 
w and n=k subchains. 

By applying PROBEl to the remaining chain j c +l..m, we can obtain a partition of 
weight w and n=k subchains (since the algorithm is assumed correct for ti-k). 
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By concatenating the deleted chain 1 ,.j c with the partition obtained above, we will 
get a partition of weight w and n=k + 1 subchains. 

Now recall that node j c was selected under the constraint that :<w and A j c was 
minimum. Thus the deletion of 1 ..j c followed by the application of PROBE 1 is 
equivalent to the application of PROBE 1 for n=k+l. This proves that if the claim 
is true of n=k it is also true for n=k+ 1. 

We have already proved the claim to be true for n= 2. It is therefore true for all 
n. 


The algorithm makes a binary search in the range W T /n, W T using the function PROBEl 
to find the partition for which the weight of the heaviest subchain is minimum. For each trial 
weight w the function PROBEl has to look at each module only once for each processor. 
Thus for m modules and n processors the function PROBEl will perform 0(mn) steps to find 
a conservative partition, if it exists. If the above range is resolved to an accuracy of e then 
the algorithm will find a conservative partition of weight w in time proportional to 
0 (m/ilog 2 (Wj-/e)) with the assurance that w is no greater than the weight of the heaviest sub- 
chain in the optimal assignment by £. Thus the order of the algorithm is 0(mnlog2(Wj-/e)). It 
is important to note that the time complexity of the algorithm is proportional to log(Wye) 
unlike other fully polynomial time approximation schemes which are polynomial in 1/e [8]. 


3. Partitioning Multiple Chains across a Host-Satellite System 

The algorithm presented in the previous section can be used to solve several other parti- 
tioning problems in Host-Satellite Systems as shown in Fig. 6. Let us assume that each chain 
has m modules, there are n satellites and that for each module i of satellite s the time required 
to run it on the satellite, e (jJ , and on the host, h i<s . For each pair of modules i and t+1 from 
satellite 5 we have the time required for interprocessor communication, c ts , should i be 
assigned to the satellite and i+1 to the host. When these n chains are partitioned between the 
host and the n satellites, the time required by the entire system to complete the processing is 
determined by the greater of (1) the individual load on the most heavily loaded satellite and 
(2) the sum of the collective loads on the host. 

We represent the load on satellite s by Q. ks , provided Modules! 1../:] are assigned to it. It 
k 

is given by 2 X> +c a,j- The remaining load, due to the rest of the modules of satellite s, is 

i= l 1 1 

m 

assigned to the host and is denoted by A s , which is equal to £ h itS +c ktS . 

i=k + 1 

The probing function, while trying to find a conservative partition of weight w, will 
assign Modules [1../:] to satellite p such that Q^<w and A v is minimum. If the total load on 

n 

the host which is equal to £A 5 <w then the function returns true and the conservative parti- 
al 

tion, and false otherwise. 
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3.1. An Example 

Let us consider an example of two satellites each having 10 modules. For the sake of 
simplicity assume that c,- «=h: s . The two satellite chains are shown in Fig. 7 (top), (the 
load on satellite 1) and A t (the remaining load on the host) are plotted against k in grey and 
black lines respectively in Fig. 7 (middle). In Fig. 7 (bottom) we plot the corresponding 
values of Q. kl and A 2 for satellite 2, against k. The total load on the host is the sum of Aj and 
A 2 . For example if k= 5 for satellite 1, and k = 8 for satellite 2 then and Q^ 2 will be 33 
and 40 respectively and the total load on the host will be equal to 29+16=45. 

For a trial weight w=44, the probing function will select k=7 for satellite 1. In this case 
Aj will be 20. Note that for any other value of k either the value of Q x l is larger than w or 
Aj is not as small as 20 as shown in Fig. 7 (middle). Similarly for satellite 2, the selected 
value of it is 8 and so A 2 will be 16. The total load on the host will then become 20+16=36 
which means that a conservative partition with vv=44 exists as shown (bold line) in Fig. 7 
(top). 

Thus for each satellite s, the probing function selects Q. krS , such that A s is minimum. All 
satellite chains are independent of each other. Thus if A p is minimum for each p, where 

n 

l<p<n, then will also be minimum. Now if there exists a partition in which the total 
p= 1 

processing time is less than or equal to w then for each individual satellite s, this partition can 
always be transformed into a conservative partition of weight w by increasing or decreasing k 
until A s becomes minimum and Q^<w. The only result of this transformation will be that the 
load on the host will either decrease or remain the same. Thus if there exists a partition of 
weight w then this approach will always find that or an equivalent assignment. 

The algorithm makes a binary search in the range W T , Wj/n, where W T is given by equa- 
tion (2), using the probing function to find the conservative partition of weight w for which w 
is minimum. 


Wj = min 


n m 


m 


\(^Z h iM maX s=l,n(X e i^ 

( s=li=l t=l J 


(2) 


Note that W T is the smaller of (1) total processing time if all modules are assigned to the 
host and (2) total processing time if no module is assigned to the host. Thus if w max is the 
maximum value of h i s for 1 <i<m and 1 <s<n then W T <mn(w max ). 

For each trial weight w the probing function has to look at each module at least once for 
each satellite before a decision is made to assign this module to the satellite or not. Thus the 
function will perform 0(mn ) steps to find a conservative partition of weight w if it exists. If 
the range W T , Wj/n is resolved to an accuracy of e then the algorithm will find a conservative 
partition of weight w in time proportional to 0(mn\og2(W T /e)) with the assurance that vv is no 
greater than the worst load on any satellite and the total load on the host in the optimal 
assignment by e. 
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4. Partitioning Distributed Programs in Host-Satellite System 

Stone has solved the problem of partitioning a distributed program over a single-host and 
single-satellite system in [1]. He has further studied the behavior of the optimal assignment as 
a function of load on the host and shows that a nesting property holds [9]. As load increases 
on the host, modules move away from the host and onto the satellite. At no point does an 
increase in load cause a module to move from the satellite onto the host 

Subsequent work by Bokhari [3] shows how this property can be exploited when 
finding optimal assignments in a single-host multiple-satellite system. We can consider the 
individual programs to have chain-like structure, regardless of their actual interconnection. 
The optimal assignment can be found using a Sum-Bottleneck path algorithm. The complexity 
of this approach is dominated by the 0(m 4 n ) algorithm that finds the individual chains, for a 
problem with n satellites, each executing a program with m modules. The partitioning of the 
chains takes far less time than 0(m 4 n ) time. 

We can find the partitioning of the chains using an approach similar to the one discussed 
in the previous section. This takes 0{mnlog{Wjlz)) time, where W T and e are as defined in the 
previous section. The overall complexity of the algorithm is still 0(m 4 n). 


5. Partitioning Chains in Shared Memory System 

Consider a chain of m modules numbered l..m. Each module i has an associated execu- 
tion cost w- and each edge (ij) has a communication cost c t j, should modules i and j be 
placed on different processors in a shared memory system. Under such a system, the total 
processing time is determined by the greater of (1) the individual execution load on the most 
heavily loaded processor and (2) the sum of communication costs between all pairs of proces- 
sors that communicate through the shared memory [3]. It has been shown by Kemighan [10] 
that it is possible to find a partition of an m module chain into n disjoint subsets such that the 
size of each subset is less than or equal to a given constant and the sum of costs on edges 
joining nodes in different subsets is minimum in time proportional to 0(m). Using this 
approach we can design a probing function which can find if a partition of the chain-like pro- 
gram exists in which the load on any processor is less than or equal to a trial weight w. If, in 
the resulting partition the sum of communication costs on edges joining modules on different 
processors is less than or equal to w then the probing function returns true and false other- 

m m 

wise. We can then make a binary search in the range ^w/n using the probing function 

i=i t=i 

to find the partition for which w is minimum assuming that there are n processors in the 
shared memory system. 


6. Partitioning Trees in Host-Satellite System 

We consider the problem of partitioning a tree structured pipelined or parallel program 
over a single-host, multiple-satellite system as shown in Fig. 10. We assume that there are m 
nodes in the program tree and there are as many satellites as the number of leaf nodes of the 
tree. We work under the constraints that (1) individual maximal subtrees of the given tree are 
assigned to each satellite and (2) that the root is always assigned to the host. The total 
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processing time under such a system will be the larger of the load on the host and the worst 
load on any satellite. 

Notation: 

hi time consumed in the execution of module i on the host. 

ei time consumed in the execution of module i on any satellite (all satellites are 

similar). 

Ci time required for communication if node i is assigned to a satellite and node 

fatherii) to the host. 


Definitions: 

w max 

load{i) 


cost(i) 
W T ■ 


depth(i) 

^max 

HostJLoad 


A; 


maximum value of hi for l<i<m. 

load on a satellite if node i and all its children nodes are assigned to the satel- 
lite. This is equal to the sum of individual el s of the modules assigned to the 
satellite plus c,% 

cost of execution of node i and its children on the host. This is equal to the 
sum of individual hi s of node i and its children. 

load on the host if all m nodes of the program tree are assigned to the host. 

m 

This is equal to YjK 
(=1 

distance of node i from the root. The value of depth(root)=0. 
the maximum value of depth(i) for 1 <i<m. 

total load on the host. If the program tree is partitioned over a host and n 
satellites then for each satellite p there will be a node n(p) assigned to the 
satellite while father(%(p )) will be assigned to the host. The value of 
HostJLoad will then be the sum of individual h{ s of the modules assigned to 

n 

the host plus 2 c ti(p)- I n tenns of W T this is given by: 
p= i 

n n 

Host_Load=Wj- £ cost(n(p)}+ £ c n(p y 
p = 1 p = 1 

Suppose we have assigned all nodes to the host except the n sons of node i 

n 

numbered l..n. The value of HostJLoad will then be Wj— X (cost(!c)-c k ). If, 

k= 1 

instead of assigning each son of node i to a separate satellite, node i itself is 
assigned to a satellite then the new value of Host-Load will be 
Wp- cost(i)+ci. The difference between the two values of Host-Load is 

n 

denoted by A/ and is equal to A i =(cost(i)-ci)-'£(cost(k)-cl). Thus if A,- is 

fcrl 

positive then the value of Host-Load will reduce by A ; if we assign a single 
satellite to node i (and its children) instead of assigning a separate satellite to 
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each son of node i. For each leaf node /, the value of A f is cost(i)-ci. 

The algorithm selects as before a trial weight w and then uses the function 
PROBEJTREE (described below). The function PROBEJTREE(w) returns true if it is possible 
to partition the program tree over the host-satellite system such that the load on any satellite 
and the load on the host is less than or equal to w, and false otherwise. The resulting parti- 
tion, if any, is called the conservative partition for a trial weight w. 

6.1. The Algorithm 

function PROBEJTREE(w): boolean; 

procedure MERGE(ifather(i )); 
begin 

e father(i) ~ e father(if~ e b 
hfatherfj) ~ 

remove edge{i father {i))\ 
end; 

begin 

for level=d mdX down to 1 do 
begin 

for each node i at depth(i)=level do 
if load{i)>w then Merge(ifather(f))\ 

end; 

HostJLoad = Wj, 
for level=d max down to 1 do 
begin 

for each node i at depth{i)-level do 
if A,<0 then Merge(ifather{i)) 

else HostJLoad = HostJLoad — Af, 

end; 

if HostJJad<w then return (true) else return (fa/se); 
end. 


6.2. Discussion 

1. The function PROBEJTREE, while trying to find a conservative partition of weight w, 
will assign a node i and its children to a satellite if and only if load(i)<w. Each node j 
for which load(j)>w is therefore merged with fatherif) by combining the execution cost 
e-j ( hj ) with e father^-) (h fatherif)) ^ removing the edg s(jfather(j)). The problem is now 
reduced to partitioning the new program tree, in which load{i)<w for each node i other 
than the root, in such a fashion that the value of HostJLoad is minimum. 

2. We assume that initially all m nodes are assigned to the host and thus the initial value of 
HostJLoad is equal to W T . The function, while examining each leaf node i at 
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depthii)=d m2X , assigns it to a satellite if A>0. If, on the other hand, A,<0 then node i is 
merged with father(i) by combining h L with and removing the edg e(if other (i)). 

In the resulting partition the value of Host-Load will reduce by an amount equal to the 
sum of individual A,’s for each leaf node i assigned to a satellite. 

3. The function then looks at each node j at a depth one less than d mzx . Remember that in 
the previous iteration of the for loop each son of node j has either already been assigned 
to a satellite or been merged with node j. If A>0 then the value of HostJLoad will 
further reduce by Ay if the previous partition is moved up one level by assigning a satel- 
lite to node j and to its children instead of keeping the previous partition in which each 
son of node j has been assigned to a separate satellite. If however A ; <0 then node j is 
merged with fat her (j) and the previous partition is maintained (assigning node j to a 
satellite will increase the value of HostJLoad instead of reducing it). 

4. The function PROBELTREE works from bottom to top in the program tree reducing the 
value of Host-Load at each iteration of the for loop by an amount equal to the sum of 
individual A,’s for each node i examined during that iteration and not merged with its 
father. Thus for each node j examined during an iteration, the decision to keep the previ- 
ous partition, or to move the partition up one level by assigning a single satellite to node 
j, is solely dependent upon node j and its sons and is not influenced by any other nodes 
examined during that iteration. The nodes merged with the root in the last iteration are 
assigned to the host and the value of Host-Load is compared with w. 

5. The resulting value of Host-Load is equal to W T minus the sum of individual A*’s for 
each node k of the program tree which is examined by the function and not merged with 
father{k). It is important to note that the policy according to which nodes are assigned to 
satellites makes sure that the value of HostJLoad reduces by a maximum amount. 

6.3. An Example 

Let us now consider an example of a tree structured program consisting of 32 nodes as 
shown in Fig. 8(a). For the sake of simplicity it is assumed that the execution cost of a 
module is the same on the satellite as well as on the host and is shown inside each node in 
Fig. 8(a). The number associated with each edge is the communication cost for the two 
modules at the ends of that edge. Trial weight w=140. 

Each node i for which load(i)>\4§ is merged with father(i) and the edg e(ifather(i)) is 
removed. The nodes merged and the edges to be removed are shown in bold in Fig. 8(b). The 
32 node program tree is thus transformed into a 28 node program tree as shown in Fig. 8(c). 

Initially all the remaining 28 nodes are assigned to the host. The value of HostJLoad 
will then be equal to Wj— 175. The value of A,- for each node i at depth{i)=5 is shown in bold 
outside each node in Fig. 8(c). The function assigns each node i to a satellite if A,->0 and 
merges it with its father if A,<0, the resulting partition is also shown in the figure. The value 
of Host-Load for this partition is 175-(3+9+5+4+4+5+8)=137. The function goes up one level 
and examines each node i at depth(i)=4. If the value of A,>0 then node / is assigned to a 
satellite and the previous partition is moved up one level with the result that the value of 
Host-Load further reduces by A,-. The resulting partition is shown in Fig. 8(d) with 
H ost_Load= 1 37-(7+5+ 1 4+9)= 1 02. In the third iteration of the second outermost for loop, the 
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value of the Host-Load is further reduced by 6+9 as shown in Fig. 8 (e). The function then 
examines each node at a depth equal to 2. The value of Host-Load is further reduced to 87- 
(7+15+12)=53 with the partition shown in Fig. 8(f). In the last iteration 3 nodes are merged 
with the root and are thus assigned to the host, the resulting partition is shown in Fig. 8(g). 
The final value of Host-Load is however the same as in the last iteration equal to 53. 

6.4. Proof of Correctness 

Claim If a tree structured program has a partition over a host satellite system in 

which the worst load on any satellite and the load on the host is w, then the 
function PROBE-TREE will find that or an equivalent assignment. 


Definitions: 

height ^max + l minus depth(i) for a node i. Thus height of root is <7,^+1. 

Tt w h a partition which guarantees that the load assigned to the host is minimum 

provided only nodes with height<h are permitted to reside on the satellites. 

Proof In a tree structured program, where each node i is already merged with 

father{i ), if load(i)>w, the above claim will be true if the function 
PROBE-TREE can find 7t w . 

By induction on height. 

Consider the case height=\. The function initially assigns all m nodes to the 
host and thus the starting value of Host-Load is W T . During the execution of 
the first iteration of the for loop, the function examines each leaf node i at 
height=\ and assigns it to a satellite if A>0. If on the other hand A ( <0 then 
node i is merged with the fatherly). In the resulting partition the value of 
Host_Load will reduce by an amount equal to the sum of individual A,’s for 
each node i assigned to a satellite. Obviously, the partition found after the 
first iteration will be 7t w l . 

We will now show that if the function PROBE-TREE can find after the 
fcth iteration then it can also find after the k+1 th iteration of the for 

loop. 

After finding % wk the function looks at each node i at height equal to k+1. If 
the value of A ; >0 then for each node i the previous partition 7t w jt is moved up 
one level by assigning a satellite to node i and its children instead of keeping 
the previous partition, and thus the value of Host-Load further reduces by A,-. 
If however A,<0 then the previous partition k is maintained. For each node 
i examined at height=k+l, the decision to keep the previous partition or to 
move the partition up one level is solely dependent upon A ( - and is not 
influenced by any other node examined during that iteration. Note that the 
partition K wk+l will either be the previous partition iz wk or the new partition 
in which node i at height=k+l is assigned to a satellite (along with its chil- 
dren). It can not be any other partition because by definition, tc w> * guarantees 
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that the value of Host-Load is minimum under the constraint that only those 
nodes with height<k are permitted to reside on the satellites. 

Thus if the function can find k w k after the &th iteration of the for loop then it 
will be transformed into 7t w * +1 after the k + 1 th iteration and into 7t Wjrfmix after 
the last iteration. 

The algorithm makes a binary search in the range W T , W T /m, and it uses the function 
PROBE-TREE to find a conservative partition of weight w for which w is minimum. For 
each trial weight w, the function PROBEJTREE has to examine each node only once to 
decide whether to assign the node and its children to a satellite or to merge it with its father. 
Thus the function performs 0(m ) steps to find a conservative partition of weight vv, if it 
exists. If the range W T , W T /m is resolved to an accuracy of e then the algorithm will find a 
conservative partition of weight w in time proportional to <9(wlog2(Wj/e)) with the assurance 
that w is no greater than the larger of the load on the host and the worst load on any satellite 
in the optimal assignment by e. 

7. Conclusions. 

We have discussed a number of partitioning problems in the field of parallel, pipelined 
and distributed computing. We have demonstrated that in a variety of such problems it is pos- 
sible to design a probing function which can find out if a partition of a parallel program over 
a multiple computer system exists in which the load on any processor is less than or equal to 
a given weight w. It has been shown that for a parallel program with a chain or tree-like 
interconnection structure, the probing function provides a truelfalse answer in linear time pro- 
vided the processor system is also limited to a chain of processors or is a host satellite sys- 
tem. The optimal partition is then found approximately by making a binary search in a finite 
range to find the partition for which w is minimum. 

In order to extend this approach to other problems, it is essential to find an efficient 
probing function. The rule on which the probing function is based is dependent upon the 
nature of the partitioning problem to be solved. For example, in the partitioning of a one 
dimensional domain over a chain of processors, the probing function was simply a greedy 
method [4] while in case of partitioning a chain-like program over a shared memory system it 
was based on a dynamic programming approach described in [10]. Once an efficient probing 
function is found for a problem, the optimal partitioning can be found by making a binary 
search in a given range using the probing function. 

Future work in this field requires that this approach be extended to multiple computer 
systems with a richer interconnection structure like a binary tree, hypercube or a mesh. It 
will be interesting to find if this approach can be efficiently applied to multiple computer sys- 
tems composed of dissimilar processors. It also remains to be seen how efficiently this type of 
approach can be applied to a two dimensional domain, with non-uniform work loads, which is 
to be partitioned into areas requiring equal computational effort [11]. 
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Fig. l Subchain Modules! y .^] are assigned to Processor p while 

Modules! k* /..x] are assigned to Processor p+ /. If x is smaller 

than q then C) - A > 0. , , because A v > A ^ . 

x+l,q+l k+1,q+1 x K 
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Module Chain 
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Of 1 W 2 fel 3 4 W| — 


Processor Chain 


Fig. 2 A 10 module chain to be mapped on a 4 processor 

chain. The number below each module is its execution 
cost. The number above each edge is the communication 
cost for the two modules at the ends of that edge. 
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Module Chain 



Processor Chain 



Modules[1..x] assigned to Processor 1 


Fig. 3 The load on processor 1, x , (grey line) and the 

remaining load a (black line). Trial weight w= 20. 






Module Chain 
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Processor Chain 


Remaining load Load on Processor 2 



Modules[3..x] assigned to Processor 2 


Fig. 4 The load on processor 2, O-j x ,(grey line) and 
the remaining load A x (black line). 


19 


Module Chain 



Processor Chain 


Remaining load Load on Processor 3 



Modules[6..x] assigned to Processor 3 


Fig. 5 The load on processor 3, x ,(grey line) and 
the remaining load A x (black line). 
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Fig. 7 (top) The two satellite chains each having 10 modules. The number 
below each module is its execution cost. The number above each 
edge is the communication cost for the two modules at the ends of 
that edge, (middle) The load on satellite 1,0^ (grey line) and the 

remaining load on Host A (black line), (bottom) The load on satellite 2 

°k,2’ (greyjine) and the remaining load on Host a 2 (black line). 

Trial weight w=AA 
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Fig. 8(a) A 32 node tree-structured program tree to be partitioned over 
a host-satellite system. The number inside each module Is the 
execution cost on the host as well as on a satellite. The number 
associated with each edge Is the communication cost for the 
two modules at the ends of that edge. Trial weight ^=140. 
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Ffg. 8(b) Each node /', for which IoadCO> \ 40 is merged with father'd) 
and the edge( f,fatherd .)) is removed. The nodes merged and 
the edges removed are shown in bold. The 32 node program 
tree is thus transformed into a 28 node program tree. 
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Fig. 8(c) The partition (shown by a grey line) is generated by the 
function when ?evef= 5. For each node i at that level the 
value of a j is shown in bold below each node. The nodes 
merged are shown in bold and are above the grey line. 
Each node below the grey line is assigned to a satellite. 
Thus net reduction in Host-Load is 3+9+5+4+4+5+8=38. 
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Fig. 8(d) The partition generated by the function when levels 4. 

The value of A j for each node / at that level is shown 
in bold. Only those nodes which are above the grey line 
are assigned to the host. The new value of Host_Load 
is 1 37-(7+5 + 1 4+9)= 1 02. 
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Fig. 8(g) The final partition (shown by black line) generated 
when level = I. The value of HostJ_oad = 53. 
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